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Abstract. A nonnegative martingale with initial value equal to one 
measures evidence against a probabilistic hypothesis. The inverse of its 
value at some stopping time can be interpreted as a Bayes factor. If 
we exaggerate the evidence by considering the largest value attained so 
far by such a martingale, the exaggeration will be limited, and there 
are systematic ways to eliminate it. The inverse of the exaggerated 
value at some stopping time can be interpreted as a p-value. We give 
a simple characterization of all increasing functions that eliminate the 
exaggeration. 
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1. INTRODUCTION 

Nonnegative martingales with initial value 1, 
Bayes factors and p-values can all be regarded as 
measures of evidence against a probabilistic hypoth- 
esis (i.e., a simple statistical hypothesis). In this ar- 
ticle we review the well-known relationship between 
Bayes factors and nonnegative martingales and the 
less well-known relationship between p- values and 
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the suprema of nonnegative martingales. Figure 1 
provides a visual frame for the relationships we 
discuss. 

Consider a random process (Xt) that initially has 
the value one and is a nonnegative martingale un- 
der a probabilistic hypothesis P (the time t may be 
discrete or continuous). We call such a martingale 
a test martingale. One statistical interpretation of 
the values of a test martingale is that they measure 
the changing evidence against P. The value Xt is 
the number of dollars a gambler has at time t if 
he begins with $1 and follows a certain strategy for 
betting at the rates given by P; the nonnegativity of 
the martingale means that this strategy never risks 
a cumulative loss exceeding the $1 with which it be- 
gan. If Xt is very large, the gambler has made a lot 
of money betting against P, and this makes P look 
doubtful. But then X u for some later time u may be 
lower and make P look better. 

The notion of a test martingale (Xt) is related to 
the notion of a Bayes factor, which is more familiar 
to statisticians. A Bayes factor measures the degree 
to which a fixed body of evidence supports P rela- 
tive to a particular alternative hypothesis Q; a very 
small value can be interpreted as discrediting P. If 
(Xt) is a test martingale, then for any fixed time t, 
1/ 'Xt is a Bayes factor. We can also say, more gener- 
ally, that the value 1/X T for any stopping time r is 
a Bayes factor. This is represented by the downward 
arrow on the left in Figure 1. 
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Fig. 1. The relationship between a Bayes factor and a p-value can be thought of as a snapshot of the dynamic relationship 
between a nonnegative martingale (Xt) with initial value 1 and the process (X^) that tracks its supremum. The snapshot could 
be taken at any time, but in our theorems we consider the final values of the martingale and its supremum process. 



Suppose we exaggerate the evidence against P by 
considering not the current value X t but the greatest 
value so far: 

XI := supX s . 

s<t 

A high XI is not as impressive as a high Xt, but 
how should we understand the difference? Here are 
two complementary answers: 

Answer 1 (Downward arrow on the right in Fig- 
ure 1). Although (A t *) is usually not a martingale, 
the final value X^ := sup s X s still has a property as- 
sociated with hypothesis testing: for every 5 G [0, 1], 
1 / X^ has probability no more than 5 of being 5 or 
less. For any t, X£ , because it is less than or equal 
to X^, has the same property. In this sense, 1/X^ 
and are p- values (perhaps conservative). 

Answer 2 (Leftward arrow at the top of Fig- 
ure 1). As we will show, there are systematic ways 
of shrinking X£ (calibrating it, as we shall say) to 
eliminate the exaggeration. There exist, that is to 
say, functions / such that lim^oo f(x) = oo and 
f(X£) is an unexaggerated measure of evidence 
against P, in as much as there exists a test mar- 
tingale (Y t ) always satisfying Y t > /(A t *) for all t. 

Answer 2 will appeal most to readers familiar with 
the algorithmic theory of randomness, where the 
idea of treating a martingale as a dynamic measure 
of evidence is well established (see, e.g., [25], Sec- 
tion 4.5.7). Answer 1 may be more interesting to 
readers familiar with mathematical statistics, where 
the static notions of a Bayes factor and a p-value 
are often compared. 



For the sake of conceptual completeness, we note 
that Answer 1 has a converse. For any random vari- 
able p that has probability 5 of being 5 or less for ev- 
ery 5 6 [0, 1], there exists a test martingale (Xt) such 
that p = 1 / X^ . This converse is represented by the 
upward arrow on the right of our figure. It may be 
of limited practical interest, because the time scale 
for (Xt) may be artificial. 

Parallel to the fact that we can shrink the running 
supremum of a test martingale to obtain an unex- 
aggerated test martingale is the fact that we can 
inflate a p-value to obtain an unexaggerated Bayes 
factor. This is the leftward arrow at the bottom of 
Figure 1. It was previously discussed in [41] and [35]. 

These relationships are probably all known in one 
form or another to many people. But they have re- 
ceived less attention than they deserve, probably be- 
cause the full picture emerges only when we bring to- 
gether ideas from algorithmic randomness and math- 
ematical statistics. Readers who are not familiar with 
both fields may find the historical discussion in Sec- 
tion 2 helpful. 

Although our theorems are not deep, we state and 
prove them using the full formalism of modern prob- 
ability theory. Readers more comfortable with the 
conventions and notation of mathematical statistics 
may want to turn first to Section 8, in which we 
apply these results to testing whether a coin is fair. 

The theorems depicted in Figure 1 are proven in 
Sections 3-7. Section 3 is devoted to mathemat- 
ical preliminaries; in particular, it introduces the 
concept of a test martingale and the wider and, in 
general, more conservative concept of a test super- 
martingale. Section 4 reviews the relationship be- 
tween test supermartingales and Bayes factors, while 
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Section 5 explains the relationship between the su- 
prema of test supermartingales and p-values. Sec- 
tion 6 explains how p- values can be inflated so that 
they are not exaggerated relative to Bayes factors, 
and Section 7 explains how the maximal value at- 
tained so far by a test supermartingale can be simi- 
larly shrunk so that it is not exaggerated relative to 
the current value of a test supermartingale. 

There are two appendices. Appendix A explains 
why test supermartingales are more efficient tools 
than test martingales in the case of continuous time. 
Appendix B carries out some calculations that are 
used in Section 8. 

2. SOME HISTORY 

Jean Ville introduced martingales into probability 
theory in his 1939 thesis [39]. Ville considered only 
test martingales and emphasized their betting inter- 
pretation. As we have explained, a test martingale 
under P is the capital process for a betting strat- 
egy that starts with a unit capital and bets at rates 
given by P, risking only the capital with which it 
begins. Such a strategy is an obvious way to test P: 
you refute the quality of P's probabilities by making 
money against them. 

As Ville pointed out, the event that a test martin- 
gale tends to infinity has probability zero, and for 
every event of probability zero, there is a test mar- 
tingale that tends to infinity if the event happens. 
Thus, the classical idea that a probabilistic theory 
predicts events to which it gives probability equal 
(or nearly equal) to one can be expressed by saying 
that it predicts that test martingales will not be- 
come infinite (or very large). Ville's idea was popu- 
larized after World War II by Per Martin-L6f [27, 28] 
and subsequently developed by Claus-Peter Schnorr 
in the 1970s [34] and A. P. Dawid in the 1980s [11]. 
For details about the role of martingales in algorith- 
mic randomness from von Mises to Schnorr, see [8]. 
For historical perspective on the paradoxical behav- 
ior of martingales when they are not required to be 
nonnegative (or at least bounded below), see [9]. 

Ville's idea of a martingale was taken up as a tech- 
nical tool in probability mathematics by Joseph Doob 
in the 1940s [26] , and it subsequently became impor- 
tant as a technical tool in mathematical statistics, 
especially in sequential analysis and time series [21] 
and in survival analysis [1]. Mathematical statistics 
has been slow, however, to take up the idea of a mar- 
tingale as a dynamic measure of evidence. Instead, 



statisticians emphasize a static concept of hypothe- 
sis testing. 

Most literature on statistical testing remains in 
the static and all-or-nothing (reject or accept) frame- 
work established by Jerzy Neyman and Egon Pear- 
son in 1933 [31]. Neyman and Pearson emphasized 
that when using an observation y to test P with re- 
spect to an alternative hypothesis Q, it is optimal to 
reject P for values of y for which the likelihood ra- 
tio P{y)/Q{y) is smallest or, equivalently, for which 
the reciprocal likelihood ratio Q(y)/P(y) is largest. 
[Here P(y) and Q(y) represent either probabilities 
assigned to y by the two hypotheses or, more gener- 
ally, probability densities relative to a common refer- 
ence measure.] If the observation y is a vector, say, 
yi,...,yt, where t continues to grow, then the re- 
ciprocal likelihood ratio Q(yi, ■ ■ ■ , y t )/P(yi, ■ ■ . , yt) 
is a discrete-time martingale under P, but math- 
ematical statisticians did not propose to interpret 
it directly. In the sequential analysis invented by 
Abraham Wald and George A. Barnard in the 1940s, 
the goal still is to define an all-or-nothing Neyman- 
Pearson test satisfying certain optimality conditions, 
although the reciprocal likelihood ratio plays an im- 
portant role [when testing P against Q, this goal 
is attained by a rule that rejects P when Q(yi, . . . , 
yt)/P(yi, ■ ■ ■ ,Ut) becomes large enough and accepts 
P when Q(yi,...,y t )/P(y 1 ,...,y t ) becomes small 
enough] . 

The increasing importance of Bayesian philoso- 
phy and practice starting in the 1960s has made 
the likelihood ratio P(y)/Q(y) even more impor- 
tant. This ratio is now often called the Bayes fac- 
tor for P against Q, because by Bayes's theorem, 
we obtain the ratio of P's posterior probability to 
Q's posterior probability by multiplying the ratio of 
their prior probabilities by this factor [20]. 

The notion of a p-value developed informally in 
statistics. From Jacob Bernoulli onward, everyone 
who applied probability theory to statistical data 
agreed that one should fix a threshold (later called 
a significance level) for probabilities, below which a 
probability would be small enough to justify the re- 
jection of a hypothesis. But because different people 
might fix this threshold differently, it was natural, in 
empirical work, to report the smallest threshold for 
which the hypothesis would still have been rejected, 
and British statisticians (e.g., Karl Pearson in 1900 
[32] and R. A. Fisher in 1925 [16]) sometimes called 
this borderline probability "the value of P." Later, 
this became "P- value" or "p- value" [3]. 
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After the work of Neyman and Pearson, which em- 
phasized the probabilities of error associated with 
significance levels chosen in advance, mathematical 
statisticians often criticized applied statisticians for 
merely reporting p- values, as if a small p- value were 
a measure of evidence, speaking for itself without 
reference to a particular significance level. This dis- 
dain for p-values has been adopted and amplified 
by modern Bayesians, who have pointed to cases 
where p-values diverge widely from Bayes factors 
and hence are very misleading from a Bayesian point 
of view [35, 43]. 

3. MATHEMATICAL PRELIMINARIES 

In this section we define martingales, Bayes fac- 
tors and p-values. All three notions have two ver- 
sions: a narrow version that requires an equality and 
a wider version that relaxes this equality to an in- 
equality and is considered conservative because the 
goal represented by the equality in the narrow ver- 
sion may be more than attained; the conservative 
versions are often technically more useful. The con- 
servative version of a martingale is a supermartin- 
gale. As for Bayes factors and p-values, their main 
definitions will be conservative, but we will also de- 
fine narrow versions. 

Recall that a probability space is a triplet (fi, F, P) , 
where f2 is a set, J 7 is a a- algebra on f2 and P is 
a probability measure on F. A random variable X 
is a real- valued J-"- measurable function on 0; we al- 
low random variables to take values ±oo . We use the 
notation E(X) for the integral of X with respect to 
P and E(X|<5) for the conditional expectation of X 
given a cr-algebra Q C F; this notation is used only 
when X is integrable [i.e., when E(X + ) < oo and 
E(X~) < oo; in particular, P{X = oo} = P{X = 
— oo} = 0]. A random process is a family (Xt) of 
random variables Xt] the index t is interpreted as 
time. We are mainly interested in discrete time (say, 
4 = 0,1,2,...), but our results (Theorems 1-4) will 
also apply to continuous time (say, t € [0,oo)). 

3.1 Martingales and Supermartingales 

The time scale for a martingale or supermartin- 
gale is formalized by a filtration. In some cases, it 
is convenient to specify this filtration when intro- 
ducing the martingale or supermartingale; in others, 
it is convenient to specify the martingale or super- 
martingale and derive an appropriate filtration from 
it. So there are two standard definitions of martin- 
gales and supermartingales in a probability space. 
We will use them both: 



(1) (X t ,Ft), where t ranges over an ordered set 
({0, 1, . . .} or [0, oo) in this article), is a supermartin- 
gale if (Ft) is a filtration (i.e., an indexed set of sub- 
cr-algebras of F such that F s C Ft whenever s < t), 
(Xt) is a random process adapted with respect to 
(Ft) (i.e., each X t is J^-measurable) , each X t is in- 
tegrable, and 

V(X t \F s ) < X s a.s. 

when s < t. A supermartingale is a martingale if, for 
all t and s <t, 

(1) E(X t \F s ) = X s a.s. 

(2) A random process (Xt) is a supermartingale 
(resp. martingale) if (Xt,Ft) is a supermartingale 
(resp. martingale), where Ft is the cr-algebra gener- 
ated by X s , s <t. 

For both definitions, the class of supermartingales 
contains that of martingales. 

In the case of continuous time we will always as- 
sume that the paths of (Xt) are right-continuous 
almost surely (they will then automatically have 
left limits almost surely; see, e.g., [13], VI. 3(2)). We 
will also assume that the filtration (Ft) in (Xt,Ft) 
satisfies the usual conditions, namely, that each cr- 
algebra Ft contains all subsets of all E € F satis- 
fying P(-E) = (in particular, the probability space 
is complete) and that (Ft) is right- continuous, in 
that, at each time t, Ft = Ft+ := f] s>t F s . If the 
original filtration (Ft) does not satisfy the usual 
conditions (this will often be the case when Ft is 
the cr-algebra generated by X s , s <t), we can rede- 
fine F as the P-completion F p of F and redefine 
Ft as Ft\_ := f] s>t Ff, where Ff is the cr-algebra 
generated by F s and the sets E € F p satisfying 
~P(E) =0; (Xt,Ft) will remain a (super)martingale 
by [13], VI.3(1). 

We are particularly interested in test supermartin- 
gales, defined as supermartingales that are nonnega- 
tive (Xt > for all t) and satisfy E(Xo) < 1, and test 
martingales, defined as martingales that are non- 
negative and satisfy E(Ao) = 1. Earlier, we defined 
test martingales as those having initial value 1; this 
can be reconciled with the new definition by set- 
ting Xt := 1 for t < 0. A well-known fact about test 
supermartingales, first proven for discrete time and 
test martingales by Ville, is that 

(2) P{A^ >c}< 1/c 

for every c> 1 ([39], page 100; [13], VI.l). We will 
call this the maximal inequality. This inequality shows 
that Xt can take the value oo only with probability 
zero. 
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3.2 Bayes Factors 

A nonnegative measurable function B : — > [0, oo] 
is called a Bayes factor for P if J (1/B) dP < 1; we 
will usually omit "for P." A Bayes factor B is said 
to be precise if J (1/B) dP = 1. 

In order to relate this definition to the notion 
of Bayes factor discussed informally in Sections 1 
and 2, we note first that whenever Q is a probability 
measure on ($7, J 7 ), the Radon-Nikodym derivative 
dQ/dP will satisfy J(dQ/dP)dP < 1, with equal- 
ity if Q is absolutely continuous with respect to P. 
Therefore, B = l/(dQ/dP) will be a Bayes factor 
for P. The Bayes factor B will be precise if Q is ab- 
solutely continuous with respect to P; in this case B 
will be a version of the Radon-Nikodym derivative 
dP/dQ. 

Conversely, whenever a nonnegative measurable 
function B satisfies j (1/B) dP < 1, we can construct 
a probability measure Q that has 1/B as its Radon- 
Nikodym derivative with respect to P. We first con- 
struct ameasure Qo by setting Qo(^4) := JAl/B)dP 
for all A £ T , and then obtain Q by adding to Qo 
a measure that puts the missing mass 1 — Qo(fi) 
(which can be 0) on a set E (this can be empty or 
a single point) to which P assigns probability zero. 
(If P assigns positive probability to every element 
of VL, we can add a new point to f2.) The function 
B will be a version of the Radon-Nikodym deriva- 
tive dP/dQ if we redefine it by setting B(u) := 
for u) € E [remember that P(E) = 0]. 

3.3 p-Values 

In order to relate p- values to supermartingales, we 
introduce a new concept, that of a p-test. A p-test 
is a measurable function p : — > [0, 1] such that 

(3) P{uj\p(u)<5}<5 

for all 5 G [0, 1] . We say that p is a precise p-test if 

(4) P{u\p(uj)<5}=5 

for all 5 £ [0, 1] . 

It is consistent with established usage to call the 
values of a p-test p-values, at least if the p-test is 
precise. One usually starts from a measurable func- 
tion T:£l — >M (the test statistic) and sets p(uj) := 
P{u/ | T(u') > T(uj)}; it is clear that a function p 
defined in this way, and any major ant of such a p, 
will satisfy (3). If the distribution of T is continu- 
ous, p will also satisfy (4). If not, we can treat the 



ties T (ui') = T (u) more carefully and set 

p(oj) :=P{u/ | T(J) > T(u)} 

+ ZP{co'\T(co')=T(co)}, 

where £ is chosen randomly from the uniform dis- 
tribution on [0, 1] ; in this way we will always obtain 
a function satisfying (4) (where P now refers to the 
overall probability encompassing generation of £). 

4. SUPERMARTINGALES AND BAYES 
FACTORS 

When (X t ,Ft) is a test supermartingale, 1/X t is 
a Bayes factor for any value of t. It is also true 
that 1/Xoo, Xoo being the supermartingale's lim- 
iting value, is a Bayes factor. Part 1 of the following 
theorem is a precise statement of the latter asser- 
tion; the former assertion follows from the fact that 
we can stop the supermartingale at any time t. 

Part 2 of Theorem 1 states that we can construct 
a test martingale whose limiting value is reciprocal 
to a given precise Bayes factor. We include this re- 
sult for mathematical completeness rather than be- 
cause of its practical importance; the construction 
involves arbitrarily introducing a filtration, which 
need not correspond to any time scale with practi- 
cal meaning. In its statement, we use J-^ to denote 
the u-algebra generated by IJt-^i- 

Theorem 1. (1) If (X tl Ft) is a test supermar- 
tingale, then Aqo := lim^oo exists almost surely 
and 1 j X^ is a Bayes factor. 

(2) Suppose B is a precise Bayes factor. Then 
there is a test martingale (X t ) such that B = l/X^ 
a.s. Moreover, for any filtration (Ft) such that B is 
Too -measurable, there is a test martingale (Xt,Ft) 
such that B = 1/ X^ almost surely. 

PROOF. If (X t , Ft) is a test supermartingale, the 
limit X^o exists almost surely by Doob's conver- 
gence theorem ([13], VI. 6), and the inequality 
J Xqq dP < 1 holds by Fatou's lemma: 

[x oo dP= I liminf AidP<liminf / X t dP<l. 

J J t—>-oo t— >oo J 

Now suppose that B is a precise Bayes factor and 
(J~t) is a filtration (not necessarily satisfying the 
usual conditions) such that B is J^-measurable; 
for concreteness, we consider the case of continu- 
ous time. Define a test martingale (X t ,Ff + ) by set- 
ting Xt := E(1/S|J : "^); versions of conditional ex- 
pectations can be chosen in such a way that (Xt) 
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is right-continuous: cf. [13], VI. 4. Then X^ = 1/B 
almost surely by Levy's zero-one law ([24], pages 
128-130; [30], VI. 6, corollary). It remains to notice 
that (Xf,J-t) will also be a test martingale. If (J-t) 
such that B is J r 00 -measurable is not given in ad- 
vance, we can define it by, for example, 

jr:= ({0,fi}, ift<l, 
' ' 1 o~(B), otherwise, 

where o~(B) is the <r-algebra generated by B. □ 

Formally, a stopping time with respect to a filtra- 
tion (J-t) is a nonnegative random variable r tak- 
ing values in [0, oo] such that, at each time t, the 
event {a; | t(lj) < t} belongs to Ft- Let (X t ,F) be 
a test supermartingale. Doob's convergence theo- 
rem, which was used in the proof of Theorem 1, 
implies that we can define its value X T at r by the 
formula X t (uj) := X t ^(uj) even when r = oo with 
positive probability. The stopped process (X[,Ft) ■= 
(XfArjTt), where a Ab := min(a, b), will also be a test 
supermartingale ([13], VI. 12). Since X T is the final 
value of the stopped process, it follows from part 1 
of Theorem 1 that 1/X T is a Bayes factor. (This also 
follows directly from Doob's stopping theorem, [30], 
VI.13.) 

5. SU PERM ARTING ALES AND p-VALUES 

Now we will prove that the inverse of a supremum 
of a test supermartingale is a p-test. This is true 
when the supremum is taken over [0, t] for some time 
point t or over [0, r] for any stopping time r, but the 
strongest way of making the point is to consider the 
supremum over all time points (i.e., for r := oo). 

We will also show how to construct a test mar- 
tingale that has the inverse of a given p-test as its 
supremum. Because the time scale for this martin- 
gale is artificial, the value of the construction is more 
mathematical than directly practical; it will help us 
prove Theorem 4 in Section 7. But it may be worth- 
while to give an intuitive explanation of the con- 
struction. This is easiest when the p-test has discrete 
levels, because then we merely construct a sequence 
of bets. Consider a p-test p that is equal to 1 with 
probability 1/2, to 1/2 with probability 1/4, to 1/4 
with probability 1/8, etc.: 

P{p = 2- n } = 2~ n - 1 

for n = 0, 1, To see that a function on O that 

takes these values with these probabilities is a p- 
test, notice that when 2" n <8< 2 _n+1 , 

P{p <5}= P{p < 2-' n } = 2~ n < S. 



Suppose that we learn first whether p is 1. Then, if 
it is not 1, we learn whether it is 1/2. Then, if it 
is not 1/2, whether it is 1/4, etc. To create the test 
martingale Xq, X\, . . . , we start with capital Xq = 1 
and bet it all against p being 1. If we lose, X\ = 
and we stop. If we win, X\ = 2, and we bet it all 
against p being 1/2, etc. Each time we have even 
chances of doubling our money or losing it all. If 
p = 2~ n , then our last bet will be against p = 2~ n , 
and the amount we will lose, 2 n , will be X^. So 
I/X^q =p, as desired. 
Here is our formal result: 

Theorem 2. (1) If (X t ,F t ) is a test supermar- 
tingale, 1 / X^ is a p-test. 

(2) If p is a precise p-test, there is a test mar- 
tingale (Xt) such that p= 1/X^. 

Proof. The inequality P{l/X^ < 5} < 5 
for test supermartingales follows from the maximal 
inequality (2). 

In the opposite direction, let p be a precise p-test. 
Set II := 1/p; this function takes values in [l,oo]. 
Define a right-continuous random process (Xt), t E 
[0,oo), by 

(-1, if t € [0,1), 

x t (uj) = lt, ifte[i,n(w)), 

1 0, otherwise. 

Since X^ = IT, it suffices to check that (Xt) is a test 
martingale. The time interval where this process is 
nontrivial is t> 1; notice that X\ = 1 with proba- 
bility one. 

Let t > 1; we then have Xt = il{n>j}. Since Xt 
takes values in the two-element set {0,t}, it is inte- 
grable. The c-algebra generated by X t consists of 4 
elements (0, Q, the set II _1 ((t,oo]), and its comple- 
ment), and the cr-algebra Ft generated by X s , s <t, 
consists of the sets II _1 (£') where E is either a Borel 
subset of [l,t] or the union of (t, oo] and a Borel sub- 
set of [l,i]. To check (1), where 1 < s < t, it suffices 
to show that 

/ X t dP= [ X s dP, 

Ju^iE) Jn- 1 (E) 

that is, 

(5) / % I>4} dP= / sI { ri> s} dP, 
Jn- 1 {E) Jn- 1 {E) 

where E is either a Borel subset of [1, s] or the union 
of (s, oo] and a Borel subset of [1, s]. If E is a Borel 
subset of [1, s], the equality (5) holds, as its two sides 
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are zero. If E is the union of (s, oo] and a Borel 
subset of [l,s], (5) can be rewritten as 



tl {n>ty dP 



■sI{n>s} dP ' 

/n- 1 ((s,oo]) " yn- 1 ((s,oo]) 
that is, tP{U >t} = sP{Il > s}, that is, 1 = 1. □ 

6. CALIBRATING p-VALUES 

An increasing (not necessarily strictly increasing) 
function /:[0, 1] — > [0,oo] is called a calibrator if 
f{p) is a Bayes factor for any p-test p. This notion 
was discussed in [41] and, less explicitly, in [35]. In 
this section we will characterize the set of all in- 
creasing functions that are calibrators; this result is 
a slightly more precise version of Theorem 7 in [41]. 

We say that a calibrator / dominates a calibra- 
tor g if f{x) < g(x) for all x £ [0, 1]. We say that 
/ strictly dominates g if / dominates g and f(x) < 
g{x) for some x & [0,1]. A calibrator is admissible if 
it is not strictly dominated by any other calibrator. 

Theorem 3. (1) An increasing function f : 
[0, 1] — > [0, oo] is a calibrator if and only if 



(6) 



dx 

W) 



< 1. 



(2) Any calibrator is dominated by an admissible 
calibrator. 

(3) A calibrator is admissible if and only if it is 
left- continuous and 



dx 

W) 



1. 



Proof. Part 1 is proven in [41] (Theorem 7), 
but we will give another argument, perhaps more 
intuitive. The condition "only if" is obvious: ev- 
ery calibrator must satisfy (6) in order to transform 
the "exemplary" p-test p(oj) = u) on the probability 
space ([0, 1], J 7 , P), where T is the Borel a-algebra 
on [0, 1] and P is the uniform probability measure 
on J 7 , into a Bayes factor. To check "if," suppose 
(6) holds and take any p-test p. The expectation 
E(l//(p)) depends on p only via the values P{p < 
c}, c € [0,1], and this dependence is monotonic: if 
a p-test pi is stochastically smaller than another 
p-test p2 in the sense that P{p± < c} > P{p2 < c} 
for all c, then E(l//(p x )) > E(l//(p 2 )). This can 
be seen, for example, from the well-known formula 



E(£) = J^P-^ > c}dc, where £ is a nonnegative 
random variable: 

/•oo 

E(l//(pi))=/ P{l//( Pl )>c}dc 

J 

roo 

> / P{l//(p 2 )>c}dc = E(l//(p 2 )). 

J 

The condition (6) means that the inequality 
E(l//(p)) < 1 holds for our exemplary p-test p; since 
p is stochastically smaller than any other p-test, this 
inequality holds for any p-test. 

Part 3 follows from part 1, and part 2 follows from 
parts 1 and 3. □ 

Equation (7) gives a recipe for producing admissi- 
ble calibrators /: take any left-continuous decreasing 
function g : [0, 1] — > [0, oo] such that J Q g(x) dx = 1 
and set f(x) := l/g(x), x £ [0, 1]. We see in this way, 
for example, that 



(8) 



f(x):=x L ~ a /a 



is an admissible calibrator for every a £ (0,1); if 
we are primarily interested in the behavior of f(x) 
as x —7- 0, we should take a small value of a. This 
class of calibrators was found independently in [41] 
and [35]. 

The calibrators (8) shrink to significantly slower 
than x as x — > 0. But there are evidently calibrators 
that shrink as fast as xln 1+a (l/x), or xln(l/x) ■ 
ln 1+a ln(l/x), etc., where a is a positive constant. 
For example, 



(9) /(*) := 



a 



1 (l + a)" Q xln 1+Q (l/x) 
if x^e" 1 "", 
otherwise, 



L oo, 

is an admissible calibrator for any a > 0. 

7. CALIBRATING THE RUNNING SUPREMA 
OF TEST SU PERM ARTING ALES 

Let us call an increasing function / : [1, oo) — > [0, oo) 
a martingale calibrator if it satisfies the following 
property: 

For any probability space (0,J-", P) and 
any test supermartingale {X t ,Ft) in this 
probability space there exists a test super- 
martingale (Y t ,Ft) such that Y t > f(X*) 
for all t almost surely. 

There are at least 32 equivalent definitions of a mar- 
tingale calibrator: we can independently replace each 
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of the two entries of "supermartingale" in the defini- 
tion by "martingale," we can independently replace 
(X t ,F t ) by (X t ) and (Y t ,F t ) by (Y t ), and we can 
optionally allow t to take value oo. The equivalence 
will be demonstrated in the proof of Theorem 4. 
Our convention is that /(oo) := Hindoo f(x) (but 
remember that X£ = oo only with probability zero, 
even for t = oo). 

As in the case of calibrators, we say that a mar- 
tingale calibrator / is admissible if there is no other 
martingale calibrator g such that g(x) > f(x) for all 
x € [1, oo) (g dominates f) and g(x) > f(x) for some 
x 6 [l,oo). 

Theorem 4. (1) An increasing function f : 
[l,oo) — ?• [0,oo) is a martingale calibrator if and only 
if 



(10) 



f 1 f(l/x)dx<l. 
Jo 



(2) Any martingale calibrator is dominated by an 
admissible martingale calibrator. 

(3) A martingale calibrator is admissible if and 
only if it is right- continuous and 



(11) 



f(l/x)dx= 1. 



Proof. We start from the statement "if" of 
part 1. Suppose an increasing function /: [l,oo) — >• 
[0,oo) satisfies (10) and (X t ,Ft) is a test super- 
martingale. By Theorem 3, g(x) := l/f(l/x), x £ 
[0,1], is a calibrator, and by Theorem 2, l/X^ is 
ap-test. Therefore, = 1/ f(X^ ) is a Bayes 

factor, that is, E(f(X^ Q )) < 1. Similarly to the proof 
of Theorem 1, we set Y t := E,(f(X^ Q )\ Ft), obtaining 
a nonnegative martingale (Y t ,Ft) satisfying Yoo = 
f(X* OQ ) a.s. We have E(F ) < 1; the case E(Y ) = 
is trivial, and so we assume E(Yo) > 0. Since 

Y t = E(f(X* OQ )\F t ) > B(f(X;)\F) = f(X* t ) a.s. 

(the case t = oo was considered separately) and we 
can make (Y t ,Ft) a test martingale by dividing each 
Y t by E(Y ) G (0,1], the statement "if" in part 1 
of the theorem is proven. Notice that our argument 
shows that / is a martingale calibrator in any of 
the 32 senses; this uses the fact that (Yt) is a test 
(super) martingale whenever (Y t ,Ft) is a test (su- 
permartingale. 

Let us now check that any martingale calibra- 
tor (in any of the senses) satisfies (10). By any of 
our definitions of a martingale calibrator, we have 



J f(X*)dP < 1 for all test martingales (X t ) and all 
t < oo. It is easy to see that in Theorem 2, part 2, 
we can replace X^ with, say, X*^ 2 , by replacing the 
test martingale (Xt) whose existence it asserts with 



X'f, 



x 



tant) 



if t <vr/2, 



Xqo, otherwise. 



Applying this modification of Theorem 2, part 2, to 
the precise p-test p(co) := u on [0, 1] equipped with 
the uniform probability measure, we obtain 

i> J f(x; /2 )dp = J /(i/ P )dP = J f(i/x)dx. 

This completes the proof of part 1. 

Part 3 is now obvious, and part 2 follows from 
parts 1 and 3. □ 

As in the case of calibrators, we have a recipe for 
producing admissible martingale calibrators / pro- 
vided by (11): take any left-continuous decreasing 
function g : [0, 1] — > [0, oo) satisfying j Q g(x) dx = 1 
and set f(y) := g(l/y), y £ [l,oo). In this way we 
obtain the class of admissible martingale calibrators 



(12) 



f(y) := ay 



l-a 



a£ (0,1), 



analogous to (8) and the class 



a(l + a) a ^r — , if y > e 1+Q , 

V 7 ln 1+ V «>°> 
0, otherwise, 



analogous to (9). 

In the case of discrete time, Theorem 4 has been 
greatly generalized by Dawid et al. ([12], Theorem 1). 
The generalization, which required new proof tech- 
niques, makes it possible to apply the result in new 
fields, such as mathematical finance ([12], Section 4). 

In this article we have considered only tests of sim- 
ple statistical hypotheses. We can use similar ideas 
for testing composite hypotheses, that is, sets of 
probability measures. One possibility is to measure 
the evidence against the composite hypothesis by 
the current value of a random process that is a test 
supermartingale under all probability measures in 
the composite hypothesis; we will call such processes 
simultaneous test supermartingales. For example, 
there are nontrivial processes that are test super- 
martingales under all exchangeable probability mea- 
sures simultaneously ([42], Section 7.1). Will mar- 
tingale calibrators achieve their goal for simultane- 
ous test supermartingales? The method of proof of 
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Theorem 4 does not work in this situation: in gen- 
eral, it will produce a different test supermartin- 
gale for each probability measure. The advantage 
of the method used in [12] is that it will produce 
one process, thus demonstrating that for each mar- 
tingale calibrator / and each simultaneous test su- 
permartingale Xt there exists a simultaneous test 
supermartingale Yt such that Yt > f(X£) for all t 
(the method of [12] works pathwise and makes the 
qualification "almost surely" superfluous). 

8. EXAMPLES 

Although our results are very general, we can il- 
lustrate them using the simple problem of testing 
whether a coin is fair. Formally, suppose we observe 
a sequence of independent identically distributed bi- 
nary random variables xx,X2,---, each taking values 
in the set {0, 1}; the probability 9 G [0, 1] of x\ = 1 
is unknown. Let Pg be the probability distribution 
of xx, X2, ■ ■ •; it is a probability measure on {0, 1}°°. 
In most of this section, our null hypothesis is that 
9 = 1/2. 

We consider both Bayesian testing of 9 = 1/2, whe- 
re the output is a posterior distribution, and non- 
Bayesian testing, where the output is a p-value. We 
call the approach that produces p-values the samp- 
ling-theory approach rather than the frequentist ap- 
proach, because it does not require us to interpret all 
probabilities as frequencies; instead, we can merely 
interpret the p- values using Cournot's principle ([36], 
Section 2). We have borrowed the term "sampling- 
theory" from D. R. Cox and A. P. Dempster [10, 14], 
without necessarily using it in exactly the same way 
as either of them do. 

We consider two tests of 9 = 1/2, corresponding 
to two different alternative hypotheses: 

(1) First, we test 9 = 1/2 against 9 = 3/4. This 
is unrealistic on its face; it is hard to imagine ac- 
cepting a model that contains only these two simple 
hypotheses. But some of what we learn from this 
test will carry over to sensible and widely used tests 
of a simple against a composite hypothesis. 

(2) Second, we test = 1/2 against the composite 
hypothesis 9 ^ 1/2. In the spirit of Bayesian statis- 
tics and following Laplace ([22]; see also [38], Sec- 
tion 870, and [37]), we represent this composite hy- 
pothesis by the uniform distribution on [0,1], the 
range of possible values for 9. (In general, the com- 
posite hypotheses of this section will be composite 



only in the sense of Bayesian statistics; from the 
point of view of the sampling-theory approach, these 
are still simple hypotheses.) 

For each test, we give an example of calibration of 
the running supremum of the likelihood ratio. In 
the case of the composite alternative hypothesis, we 
also discuss the implications of using the inverse of 
the running supremum of the likelihood ratio as a 
p- value. 

To round out the picture, we also discuss Bayesian 
testing of the composite hypothesis 9 < 1/2 against 
the composite hypothesis 9 > 1/2, representing the 
former by the uniform distribution on [0, 1 /2] and 
the latter by the uniform distribution on (1/2,1]. 
Then, to conclude, we discuss the relevance of the 
calibration of running suprema to Bayesian philos- 
ophy. 

Because the idea of tracking the supremum of 
a martingale is related to the idea of waiting un- 
til it reaches a high value, our discussion is related 
to a long-standing debate about "sampling to reach 
a foregone conclusion," that is, continuing to sam- 
ple in search of evidence against a hypothesis and 
stopping only when some conventional p-value fi- 
nally dips below a conventional level such as 5%. 
This debate goes back at least to the work of Fran- 
cis Anscombe in 1954 [4]. In 1961, Peter Armitage 
described situations where even a Bayesian can sam- 
ple to a foregone conclusion ([6]; [7], Section 5.1.4). 
Yet in 1963 [15], Ward Edwards and his co-authors 
insisted that this is not a problem: "The likelihood 
principle emphasized in Bayesian statistics implies, 
among other things, that the rules governing when 
data collection stops are irrelevant to data interpre- 
tation. It is entirely appropriate to collect data until 
a point has been proven or disproven, or until the 
data collector runs out of time, money, or patience." 
For further information on this debate, see [43]. We 
will not attempt to analyze it thoroughly, but our 
examples may be considered a contribution to it. 

8.1 Testing = 1/2 Against a Simple 
Alternative 

To test our null hypothesis 9 = 1/2 against the 
alternative hypothesis = 3/4, we use the likelihood 
ratio 

_ P 3 /4(xi,...,X t ) 

Px/ 2 (xx,...,x t ) 

(13) 

(3/4) fct (l/4)*" fct _ 3 fct 

~ (i/2)* 
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where kt is the number of Is in x±, . . . ,xt [and Pq{x\, 
. . . , xt) is the probability under Pg that the first t ob- 
servations are x±, . . . , Xt\ such informal notation was 
already used in Section 2] . The sequence of succes- 
sive values of this likelihood ratio is a test martingale 
(Xt). 

According to (12), the function 

(14) /(y):=0.1y - 9 

is a martingale calibrator. So there exists a test mar- 
tingale (Yj) such that 



(15) 



Y t > max 0.1X 

n=l,...,t 



0.9 



Figure 2 shows an example in which the martin- 
gale calibrator (14) preserves a reasonable amount of 
the evidence against 9 = 1/2. To construct this fig- 
ure, we generated a sequence x±,. . . , 210,000 of 0s and 
Is, choosing each xt independently with the proba- 
bility 9 for xt = 1 always equal to In 2/ In 3 ~ 0.63. 
Then we formed the lines in the figure as follows: 

• The red line is traced by the sequence of num- 
bers X t = 3 fc 72*. If our null hypothesis 9 = 1/2 
were true, these numbers would be a realization 
of a test martingale, but this hypothesis is false 
(as is our alternative hypothesis 9 = 3 /4) . 

• The upper dotted line is the running supremum 
of the X t : 



max 

i=i,...,t 2 n 



(best evidence so far against 9 = l/2) t 



• The lower dotted line, which we will call Ft, shrinks 
this best evidence using our martingale calibrator: 
F t = 0.1(A7)°- 9 . 

• The blue line, which we will call it, is a test 
martingale under the null hypothesis that satis- 
fies (15): Y t > F t . 

According to the proof of Theorem 4, E(0.1(A^) - 9 | 
Ft) /E(0.1(X^ o )°- 9 ) , where the expected values are 
with respect to P\/2, is a test martingale that sat- 
isfies (15). Because these expected values may be 
difficult to compute, we have used in its stead in the 
role of Yt a more easily computed test martingale 
that is shown in [12] to satisfy (15). 

Here are the final values of the processes shown in 
Figure 2: 



Xio,ooo — 2.2, 
Finnno = 1.9 x 10 13 , Y 



^10,000 



15 



10,000 



7.3 x 10 
2.2 x 10 13 . 



The test martingale Yt legitimately and correctly 
rejects the null hypothesis at time 10,000 on the 
basis of XfS high earlier values, even though the 
Bayes factor Aio,ooo is n °t high. The Bayes factor 
iio,ooo gives overwhelming evidence against the null 
hypothesis, even though it is more than two orders 
of magnitude smaller than X^q qqq . 

As the reader will have noticed, the test martin- 
gale Xt's overwhelming values against 9 = 1/2 in 
Figure 2 are followed, around t = 7,000, by over- 
whelming values (order of magnitude 10 -15 ) against 



in 
I 



o 

T 



in 
T 




T 
10000 



400 



1600 



3600 



6400 



Fig. 2. The red line is a realization over 10,000 trials of the likelihood ratio for testing 8 — 1/2 against 6 — 3/4. The horizontal 
axis gives the number of observations so far. The vertical axis is logarithmic and is labeled by powers of 10. The likelihood 
ratio varies wildly, up to 10 15 and down to 10~ 15 . Were the sequence continued indefinitely, it would be unbounded in both 
directions. 
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9 = 3/4. Had we been testing = 3/4 against 9 = 
1/2, we would have found that it can also be re- 
jected very strongly even after calibration. The fact 
that (Xt) and (1/Xt) both have times when they are 
very large is not accidental when we sample from 



In 2/ In 3- 



Under this measure, the conditional ex- 



pected value of the increment In — ]nXt—i, given 
the first t — 1 observations, is 



ln2, 3 

In - + 

In 3 2 



1 



In 2 
ha3 



In I 

2 



0. 



So InXt is a martingale under Pi n 2/in3- The condi- 
tional variance of its increment is 



m3V 2/ V ln3 /V 2 
By the law of the iterated logarithm, 

lnX t 

lim sup — = 

t->oo ^21n21n(3/2)f lnlnf 



3 

ln21n-. 
2 



and 



lim inf — 1 — 

• v /21n21n(3/2)tlnlnt 



■1 



almost surely. This means that as t tends to oo, In 
oscillates between approximately ±0.75y/t In hot; in 
particular, 

(16) limsupXt = oo and liminfXj=0 

almost surely. This guarantees that we will eventu- 
ally obtain overwhelming evidence against whichever 
of the hypotheses 9 = 1/2 and 9 = 3/4 that we want 



to reject. This may be called sampling to a foregone 
conclusion, but the foregone conclusion will be cor- 
rect, since both 9 = 1/2 and 9 = 3/4 are wrong. 

In order to obtain (16), we chose x\ , . . . , xio,ooo 
from a probability distribution, -Pi n 2/in3> that lies 
midway between P 1 / 2 and P 3 / 4 in the sense that it 
tends to produce sequences that are as atypical with 
respect to the one measure as to the other. Had we 
chosen a sequence xi, . . . ,a;io,ooo less atypical with 
respect to P3/4 than with respect to P\/2, then we 
might have been able to sample to the foregone con- 
clusion of rejecting 9 = 1/2, but not to the foregone 
conclusion of rejecting 9 = 3/4. 

8.2 Testing 6 = 1/2 Against a Composite 
Alternative 

Retaining 9 = 1/2 as our null hypothesis, we now 
take as our alternative hypothesis the probability 
distribution Q obtained by averaging Pg with re- 
spect to the uniform distribution for 9. 

After we observe xi,...,xt, the likelihood ratio for 
testing Pi 12 against Q is 

Q(xi, 



X t 



P 



(17) 



■ ■,x t ) 



1/2(>1, 

Jo 1 9 kt (l - 9) l ~ kt d9 _ k t \{t - fcf)!2 f 



(1/2)* (t + l)\ ■ 

Figure 3 shows an example of this process and of 
the application of the same martingale calibrator, 
(14), that we used in Figure 2. In this case, we gen- 
erate the 0s and Is in the sequence £1, ... ,0:10,000 



1 1 1 1 1 1 

400 1600 3600 6400 10000 

Fig. 3. A realization over 10,000 trials of the likelihood ratio for testing 9 — 1/2 against the probability distribution Q 
obtained by averaging Pq with respect to the uniform distribution for 9. The vertical axis is again logarithmic. As in Figure 2, 
the oscillations would be unbounded if trials continued indefinitely. 
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independently but with a probability for Xt = 1 that 
slowly converges to 1/2: % + \ yhit/t. As we show in 
Appendix B, (16) again holds almost surely; if you 
wait long enough, you will have enough evidence 
to reject legitimately whichever of the two false hy- 
potheses (independently and identically distributed 
with 9 = 1/2, or independently and identically dis- 
tributed with 9 7^ 1/2) you want. 

Here are the final values of the processes shown in 
Figure 3: 



Aio,ooo> 

-^10,000! 



3.5, ^10,000 — 3599, 
159, Yio ,ooo = 166. 



In this case, the evidence against 9 = 1/2 is very 
substantial but not overwhelming. 

8.3 p-Values for Testing 6 = 1/2 

By Theorem 2, 1/X^ is a p-test whenever (Xt) is 
a test martingale. Applying this to the test martin- 
gale (17) for testing P 1 / 2 against Q, we see that 

p(x!,X2,...) : 



(18) 



su Pl < t<0O (fc t !(i-A; t )!27(i + l) 
inf <' + 1 > ! 



i<i<oo k t l(t- fo)!2* 

is a p-test for testing 9 = 1/2 against 9^1/2. Fig- 
ure 4 shows that it is only moderately conserva- 
tive. 

Any function of the observations that is bounded 
below by a p-test is also a p-test. So for any rule TV 
for selecting a positive integer N(x±,X2, ■ ■ ■) based 



on knowledge of some or all of the observations x\, 
x 2 , ■ ■ ■ , the function 

(N + l)\ 

(19) r N ( Xl ,x 2 ,...):= kNl{N _ kN)l2N 

is a p-test. It does not matter whether N qualifies 
as a stopping rule [i.e., whether x\,...,x n always 
determine whether N(x±,x 2 , ■ ■ ■) <n}. 
For each positive integer n, let 

(n + 1)! 

(20) Vn '- k n \(n-k n )\2n 

We can paraphrase the preceding paragraph by say- 
ing that p n is a p- value (i.e., the value of a p-test) no 
matter what rule is used to select n. In particular, it 
is a p- value even if it was selected because it was the 
smallest number in the sequence pi , p 2 , . ■ . , p n , ■ ■ ■ , Pt , 
where t is an integer much larger than n. 

We must nevertheless be cautious if we do not 
know the rule ./V — if the experimenter who does the 
sampling reports to us p n and perhaps some other 
information but not the rule N. We can consider the 
reported value of p n a legitimate p-value whenever 
we know that the experimenter would have told us 
p n for some n, even if we do not know what rule iV 
he followed to choose n and even if he did not follow 
any clear rule. But we should not think of Pn as a 
p-value if it is possible that the experimenter would 
not have reported anything at all had he not found 
an n with a p n to his liking. We are performing a 
p-test only if we learn the result no matter what it is. 

Continuing to sample in search of evidence against 
9=1/2 and stopping only when the p-value finally 
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4000 



6000 



8000 



10000 



Fig. 5. The ratio (23) as n ranges from 100 to 10,000. This is the factor by which not knowing n in advance widens the 
99% prediction interval for k n . Asymptotically, the ratio tends to infinity with n as cy/lnn for some positive constant c. 



reaches 5% can be considered legitimate if instead 
of using conventional p-tests for fixed sample sizes 
we use the p-test (19) with iV defined by 

(n + 1)! 



But we must bear in mind that N(x±,X2, ■ ■ ■) may 
take the value oo. If the experimenter stops only 
when the p- value dips down to the 5% level, he has 
a chance of at least 95%, under the null hypothesis, 
of never stopping. So it will be legitimate to inter- 
pret a reported p n of 0.05 or less as a p- value (the 
observed value of a p-test) only if we were somehow 
also guaranteed to hear about the failure to stop. 

8.4 Comparison with a Standard p-Test 

If the number n of observations is known in ad- 
vance, a standard sampling-theory procedure for tes- 
ting the hypothesis 9 = 1/2 is to reject it if \k n — 
n/2\ > c„ j( 5, where c n ^ is chosen so that Pi/2{\k n — 
n/2\ > c n $] is equal (or less than but as close as 
possible) to a chosen significance level 5. To see how 
this compares with the p- value p n given by (20), let 
us compare the conditions for nonrejection: 

• If we use the standard procedure, the condition 
for not rejecting 9 = 1/2 at level 5 is 

(21) \k n -n/2\<c n>s . 

• If we use the p- value p n , the condition for not 
rejecting 9 = 1/2 at level 5 is p n > 5, or 

(n + 1)! 



(22) 



k n \(n 



>5. 



In both cases, k n satisfies the condition with proba- 
bility at least 1 — 6 under the null hypothesis, and, 
hence, the condition defines a level 1 — 6 prediction 
interval for k n . Because condition (21) requires the 
value of n to be known in advance and condition (22) 
does not, we can expect the prediction interval de- 
fined by (22) to be wider than the one determined 
by (22). How much wider? 

Figure 5 answers this question for the case where 
6 = 0.01 and 100 < n < 10,000. It shows, for each 
value of n in this range, the ratio 

width of the 99% prediction interval given by (22) 



width of the 
(23) 



prediction interval given by (21) 



that is, the factor by which not knowing n in ad- 
vance widens the prediction interval. The factor is 
less than 2 over the whole range but increases steadi- 
ly with n. 

As n increases further, the factor by which the 
standard interval is multiplied increases without limit, 
but very slowly. To verify this, we first rewrite (22) as 



(24) 



1 1 1 

- n/2\ < (1 + a n )y/n\l -In - + - Inn, 



where a n is a sequence such that a n — > as n — > oo. 
[For some a n of order o(l) the inequality (24) is 
stronger than p n > 6, whereas for others it is weaker; 
see Appendix B for details of calculations.] Then, 
using the Berry-Esseen theorem and letting z e stand 
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for the upper e-quantile of the standard Gaussian 
distribution, we rewrite (21) as 



(25) 

where a r 



< 7j z S/2+a n 



n, 



is a sequence such that \a r 
~ x l 2 for all n. (See [17].) As 5 -> 0, 



< 



Z&/2 



2 In- 



2 In. 



1 



So the main asymptotic difference between (24) and 
(25) is the presence of the term |lnn in (24). 

The ratio (23) tends to infinity with n as c\/lnre 
for a positive constant c (namely, for c= l/zgn, 
where 5 = 0.01 is the chosen significance level). How- 
ever, the expression on the right-hand side of (24) 
results from using the uniform probability measure 
on 9 to average the probability measures Pq. Aver- 
aging with respect to a different probability measure 
would give something different, but it is clear from 
the law of the iterated logarithm that the best we 
can get is a prediction interval whose ratio with the 
standard interval will grow like Vhi Inn instead of 
Vlnn. In fact, the method we just used to obtain 
(24) was used by Ville, with a more carefully chosen 
probability measure on 8, to prove the upper half of 
the law of the iterated logarithm ([39], Section V.3), 
and Ville's argument was rediscovered and simpli- 
fied using the algorithmic theory of randomness in 
[40], Theorem 1. 

8.5 Testing a Composite Hypothesis Against 
a Composite Hypothesis 

When Peter Armitage pointed out that even Baye- 
sians can sample to a foregone conclusion, he used 
as an example the Gaussian model with known vari- 
ance and unknown mean [6] . We can adapt Armita- 
ge's idea to coin tossing by comparing two composite 
hypotheses: the null hypothesis 9 < 1/2, represented 
by the uniform probability measure on [0, 1/2], and 
the alternative hypothesis 9 > 1/2, represented by 
the uniform probability measure on (1/2, 1]. (These 
hypotheses are natural in the context of paired com- 
parison: see, e.g., [23], Section 3.1.) The test mar- 
tingale is 



2fu 2 kt (l-9) t - kt d9 



(26) 



i/ 

2j l J 2 9Hi-ey 

P{B t+ i >h + l} 



d9 



where B n is the binomial random variable with pa- 
rameters n and 1/2; see Appendix B for details. 
If the sequence xi,x 2 ,--- turns out to be typical 
of 9 = 1/2, then by the law of the iterated loga- 
rithm, (kt — t/2)/\/t will almost surely have oo as 
its upper limit and — oo as its lower limit; there- 
fore, (16) will hold again. This confirms Armitage's 
intuition that arbitrarily strong evidence on both 
sides will emerge if we wait long enough, but the os- 
cillation depends on increasingly extreme reversals 
of a random walk, and the lifetime of the universe 
may not be long enough for us to see any of them 
[■v/lnln(5 x 10 23 ) <2]. 

Figure 6 depicts one example, for which the final 
values are 



X 



10,000 



3.7, X 



10,000 



10,000 



15.5, Yro. 



000 



272, 
17.9. 



In this realization, the first 10,000 observations provi- 
de modest evidence against 9 < 1/2 and none against 
9 > 1/2. Figures 2 and 3 are reasonably typical for 
their setups, but in this setup it is unusual for the 
first 10,000 observations to show even as much ev- 
idence against one of the hypotheses as we see in 
Figure 6. 

8.6 A Puzzle for Bayesians 

From a Bayesian point of view, it may seem puz- 
zling that we should want to shrink a likelihood ratio 
in order to avoid exaggerating the evidence against 
a null hypothesis. Observations affect Bayesian pos- 
terior odds only through the likelihood ratio, and we 
know that the likelihood ratio is not affected by the 
sampling plan. So why should we adjust it to take 
the sampling plan into account? 

Suppose we assign equal prior probabilities of 1/2 
each to the two hypotheses 9 = 1/2 and 9 = 3/4 in 
our first coin-tossing example. Then if we stop at 
time t, the likelihood ratio Xt given by (13) is iden- 
tical with the posterior odds in favor of 9 = 3/4. If 
we write post 4 for the posterior probability measure 
at time t, then 



X f 



and 
(27) 



post t {6> = 3/4} _ 1 - post t {6> = 1/2} 
post t {# = 1/2} " post,{# = l/2} 



post t {0 = l/2} 



X t + 1 



This is our posterior probability given the evidence 
x±, . . . , Xf no matter why we decided to stop at time t. 



MARTINGALES AND P- VALUES 



15 





r 

10000 



400 



1600 



3600 



6400 



Fig. 6. A realization over 10,000 trials of the likelihood ratio for testing the probability distribution obtained by averaging Pg 
with respect to the uniform probability measure on [0,1/2] against the probability distribution obtained by averaging Pg with 
respect to the uniform probability measure on (1/2,1]. As in the previous figures, the vertical axis is logarithmic, and the red 
line would be unbounded in both directions if observations continued indefinitely. 



If we "calibrate" Xt and plug the calibrated value 
instead of the actual value into (27), we will get the 
posterior probability wrong. 

It may help us escape from our puzzlement to 
acknowledge that if the model is wrong, then the 
observations may oscillate between providing over- 
whelming evidence against 9 = 1/2 and providing 
overwhelming evidence against = 3/4, as in Fig- 
ure 2. Only if we insist on retaining the model in 
spite of this very anomalous phenomenon will (27) 
continue to be our posterior probability for 9 = 1/2 
at time t, and it is this stubbornness that opens the 
door to sampling to whichever foregone conclusion 
we want, 9 = 1/2 or 9 = 3/4. 

The same issues arise when we test 9 = 1/2 against 
the composite hypothesis 9 ^ 1/2. A natural Baye- 
sian method for doing this is to put half our prob- 
ability on 9 = 1/2 and distribute the other half uni- 
formly on [0, 1] (which is a special case of a widely 
recommended procedure described in, e.g., [7], page 
391). This makes the likelihood ratio Xt given by 
(17) the posterior odds against 9 = 1/2. As we have 
seen, if the observations xi,X2,--- turn out to be 
typical for the distribution in which they are in- 
dependent with the probability for xt = 1 equal to 
^ + j^lnt/t, then if you wait long enough, you can 
observe values of Xt as small or as large as you like, 
and thus obtain a posterior probability for 9 = 1/2 
as large or as small as you like. 



Of course, it will not always happen that the ac- 
tual observations are so equidistant from a simple 
null hypothesis and the probability distribution rep- 
resenting its negation that the likelihood ratio will 
oscillate wildly and you can sample to whichever side 
you want. More often, the likelihood ratio and hence 
the posterior probability will settle on one side or the 
other. But in the spirit of George Box's maxim that 
all models are wrong, we can interpret this not as 
confirmation of the side favored but only as confir- 
mation that the other side should be rejected. The 
rejection will be legitimate from the Bayesian point 
of view, regardless of why we stopped sampling. 
It will also be legitimate from the sampling-theory 
point of view. 

On this argument, it is legitimate to collect data 
until a point has been disproven but not legitimate 
to interpret this data as proof of an alternative hy- 
pothesis within the model. Only when we really know 
the model is correct can we prove one of its hypothe- 
ses by rejecting the others. 

APPENDIX A: INADEQUACY OF TEST 
MARTINGALES IN CONTINUOUS TIME 

In this appendix we will mainly discuss the case 
of continuous time; we will see that in this case the 
notion of a test martingale is not fully adequate for 
the purpose of hypothesis testing (Proposition 2). 
Fix a filtration {/Ft) satisfying the usual conditions; 
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in this appendix we will only consider supermartin- 
gales (X t ,Ft), arid we will abbreviate (Xt,J-'t) to 
(X t ), or even to X t or X. 

In discrete time, there is no difference between 
using test martingales and test supermartingales for 
hypothesis testing: every test martingale is a test 
supermartingale, and every test supermartingale is 
dominated by a test martingale (according to Doob's 
decomposition theorem, [30], VII. 1); therefore, us- 
ing test supermartingales only allows discarding ev- 
idence as compared to test martingales. In contin- 
uous time, the difference between test martingales 
and test supermartingales is essential, as we will 
see below (Proposition 2). For hypothesis testing we 
need "local martingales," a modification of the no- 
tion of martingales introduced by Ito and Watan- 
abe [18] and nowadays used perhaps even more of- 
ten than martingales themselves in continuous time. 
This is the principal reason why in this article we 
use test supermartingales so often starting from Sec- 
tion 3. 

We will say that a random process (Xt) is a local 
member of a class C of random processes (such as 
martingales or supermartingales) if there exists a se- 
quence n < T2 < • • • of stopping times (called a lo- 
calizing sequence) such that r n — > oo a.s. and each 
stopped process X T t n = X t /\ Tn belongs to the class C. 
(A popular alternative definition requires that each 
X t AT„^{ Tn >o} should belong to C) A standard argu- 
ment (see, e.g., [13], VI. 29) shows that there is no 
difference between test supermartingales and local 
test supermartingales: 

Proposition 1. Every local test supermartin- 
gale (Xt) is a test supermartingale. 

Proof. Let ti,T2,... be a localizing sequence, 
so that T n — > oo as n — > oo a.s. and each X Tn , n = 
1, 2, . . . , is a test supermartingale. By Fatou's lemma 
for conditional expectations, we have, for < s < t, 

E(A t |^)=E(lim X T t "\F s ) 

< ]iminfE(XT n \T s ) 

n— >oo 

< liminf Aj n = X s a.s. 

n— >oo 

In particular, E(X t ) < 1. □ 

An adapted process (At) is called increasing if 
Aq = a.s. and its every path is right-continuous 
and increasing (as usual, not necessarily strictly in- 
creasing). According to the Doob-Meyer decomposi- 



tion theorem ([13], Theorem VII. 12), every test su- 
permartingale (Xt) can be represented as the dif- 
ference Xt = Yt — At of a local test martingale (Yt) 
and an increasing process (At). Therefore, for the 
purpose of hypothesis testing in continuous time, 
local test martingales are as powerful as test su- 
permartingales: every local test martingale is a test 
supermartingale, and every test supermartingale is 
dominated by a local test martingale. 

In discrete time there is no difference between 
local test martingales and test martingales ([13], 
(VI. 31.1)). In continuous time, however, the differ- 
ence is essential. Suppose the filtration (Ft) admits 
a standard Brownian motion (Wt,J~t) in R 3 . A well- 
known example ([19]; see also [30], VI. 21, and [13], 

VI. 26) of a local martingale which is not a martin- 
gale is Lt := 1/H Wt + e||, where e is a vector in M 3 
such that ||e|| = 1 [e.g., e = (1, 0, 0)]; L t being a local 
martingale can be deduced from 1/|| • || (the Newto- 
nian kernel) being a harmonic function on M 3 \ {0}. 
The random process (Lt) is a local test martingale 
such that sup t E(L|) < oo; nevertheless, it fails to be 
a martingale. See, for example, [29] (Example 1.140) 
for detailed calculations. 

The local martingale Lt := l/||Wi + e|| provides 
an example of a test supermartingale which cannot 
be replaced, for the purpose of hypothesis testing, 
by a test martingale. According to another version 
of the Doob-Meyer decomposition theorem ([30], 

VII. 31), a supermartingale (X t ) can be represented 
as the difference Xt = Yt — At of a martingale (Yt) 
and an increasing process (At) if and only if (Xt) 
belongs to the class (DL). The latter is defined as 
follows: a supermartingale is said to be in (DL) if, for 
any a > 0, the system of random variables X T , where 
r ranges over the stopping times satisfying r < o, is 
uniformly integrable. It is known that (Lt), despite 
being uniformly integrable (as a collection of ran- 
dom variables Lt), does not belong to the class (DL) 
([30], VI.21 and the note in VI. 19). Therefore, (L t ) 
cannot be represented as the difference Lt = Yt — At 
of a martingale (Yt) and an increasing process (At). 
Test martingales cannot replace local test martin- 
gales in hypothesis testing also in the stronger sense 
of the following proposition. 

Proposition 2. Let 5 > 0. It is not true that for 
every local test martingale (Xt) there exists a test 
martingale (Yt) such that Yt > 5Xt a.s. for all t. 

Proof. Let X t := L t = l/||Wf + e||, and suppose 
there is a test martingale (Yt) such that Yt > 8Xt a.s. 
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for all t. Let e > be arbitrarily small. Since (Yj) is 
in (DL) ([30], VI. 19(a)), for any o > we can find 
C > such that 



sup / Y T dP < eS, 

<{Y T >C} 

t ranging over the stopping times satisfying r < a. 
Since 

sup I X r dP<sup f (Y T /5) dP 

r J{X T >C/S} r J{Y T >C} 

(Xt) is also in (DL), which we know to be false. □ 

APPENDIX B: DETAILS OF CALCULATIONS 

In this appendix we will give details of some calcu- 
lations omitted in Section 8. They will be based on 
Stirling's formula n\ = v / 27rn(n/e) n e An , where A n = 
o(l) as n — > oo. 

B.l Oscillating Evidence when Testing Against 
a Composite Alternative 

First we establish (16) for Xt defined by (17). 
Suppose we have made t observations and observed 
k := kt Is so far. We start from finding bounds on k 
that are implied by the law of the iterated logarithm. 
Using the simplest version of Euler's summation for- 
mula (as in [5], Theorem 1), we can find its expected 
value as 



t 



n=l \ / 

f + l^^lnn+l^ ly. 



2 4 ^ V J ix In n ) 4 ' V v/n Inn 

n=2 v v ' n=2 v v 



Its variance is 



t 



var 



, , , v— \ I 1 1 / In n \ I 1 1 /Inn 

n=l 
/ 



2 4Vn/\2 4 V n 
/ 1 1 In n \ t 



^V4 16 n 

n=l 



Therefore, Kolmogorov's law of the iterated loga- 
rithm gives 



(28) 



k-(i/2)(t + VthYt) 

lim sup ^ = 1 and 

t^oo ^(1/2)* In kit 



.. . ,fc-(l/2)(t + Vtlnt) 
Inn mi — 

t^°° ^(1/2)4 In lnt 



1 a.s. 



Using the definition (17) and applying Stirling's 
formula, we obtain 

lnX t = tln2 + ln fc! ^~ fc ) ! -ln(t + l) 



k(t-k) 



(29) 



tin 2 - tH(k/t) +ln y 2vr 
+ A fc + A t _ fc - A 4 - ln(i + 1) 
tQn2-H(k/t)) - - hit + 0(1) 
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k 1 

t ~ 2 



lnt + 0(l) a.s., 



where := — plnp — (1 — p) ln(l — p), p € [0,1], is 

the entropy function; the last equality in (29) uses 
]n2-H(p) = 2(p-l/2) 2 + 0(|p-l/2| 3 ) asp^ 1/2. 
Combining (29) with (28), we further obtain 



(30) 



nm sup 

t-s-oo v 2 In t In In i 



lim inf ■ 



lnX t 



1 and 



-1 a.s. 



'*-►«> y/2 In fin In t 

B.2 Prediction Interval 

Now we show that (22) can be rewritten as (24). 
For brevity, we write k for k n . Similarly to (29), we 
can rewrite (22) as 



(31) 



In 2 - H(k/n) + - In J 2ir^— ^ 
n V n 

, A fc + A n _ fc - A n 1 ln(l/<5) 
H mm + 1) < 

n n n 



Since In 2 - H(p) ~ 2(p - 1/2) 2 (p 1/2), we have 
fc/n = 1/2 + o(l) for satisfying (31), as n — > oo. 
Combining this with (31), we further obtain 



k_l 
n~2 



<{l + a r 



ln(l/£) - In ^/n + ln(n + 1) + p n 



n 



for some a n = o(l) and /3 n = 0(1), which can be 
rewritten as (24) for a different sequence a n = o(l). 

B.3 Calculations for Armitage's Example 

Finally, we deduce (26). Using a well-known ex- 
pression ([2], 6.6.4) for the regularized beta function 
I p (a,b) := B(p;a,b)/B(a,b) and writing k for kt, we 
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obtain 
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(32) 



X t = (B(k + l,t-k + l) 

- B(l/2;k + l,t-k+l)) 
/B(l/2;k + l,t-k + l) 
1 



I 1/2 (k + l,t-k + l) 
1 



1 



P{B t +i >k + l} 

PjBt+i < k} 
P{B t+1 >k + l}- 

As a final remark, let us compare the sizes of os- 
cillation of the log likelihood ratio InXt that we 
have obtained in Section 8 and in this appendix for 
our examples of the three kinds of Bayesian hypoth- 
esis testing. When testing a simple null hypothe- 
sis against a simple alternative, InXt oscillated be- 
tween approximately ±0.75\/t In hit (as noticed in 
Section 8.1). When testing a simple null hypothe- 
sis against a composite alternative, InXt oscillated 
between ±v2hit In hot [see (30)]. And finally, when 
testing a composite null hypothesis against a com- 
posite alternative, we can deduce from (32) that 



lnX t 
lim sup - — - — 
t_>oo mint 



1 and 



lim inf - — — - 

t-tco In In t 



-1 a.s. 



(details omitted); therefore, In Xt oscillates between 
± In Int. Roughly, the size of oscillations of InXt 
goes down from \ft to \J\ut to In Int. Of course, 
these sizes are only examples, but they illustrate 
a general tendency. 
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